UVOD

Tema našeg projekta je analiza “Velikih 5 ličnosti”, takozvani “Big Five Personality test”. Podatke koje imamo nastali su ispitivanjem više od 300 000 ljudi diljem svijeta pomoću poznatog psihološkog testa koji je dostupan svima na internetu. Na temelju tog testa zaključe se koeficijenti 5 ličnosti. Test je napravljen tako da tih 5 ličnosti, u našem slučaju 5 koeficijenata, nisu povezani. 5 ličnosti koje se ispituju su: 1) otvorenost prema novim iskstvima 2) neuroticizam 3) savjesnost prema poslu i obvezama 4) ugodnost 5) ekstorverzija

OPIS SKUPA PODATAKA

#ucitavanje potrebnih paketa

library(dplyr)

#Učitavanje podataka

Učitajmo podatke iz .csv file-a

bigFiveData = read.csv("../big_five_scores.csv")
dim(bigFiveData)
## [1] 307313      9

Podaci se sastoje od 307 313 testiranih ljudi i 9 varijabli koje promatramo.

Popis varijabli koje promatramo:

names(bigFiveData)
## [1] "case_id"                 "country"                
## [3] "age"                     "sex"                    
## [5] "agreeable_score"         "extraversion_score"     
## [7] "openness_score"          "conscientiousness_score"
## [9] "neuroticism_score"

Za testirane sudionike u tablici su navedeni njihovi podaci (godina, zemlja, spol) te 5 faktora koje promtramo(ekstraverzija, ugodnost, savjesnost, neuroticizam i otvorenost)

Možemo promotriti ponašanje varijabli.

summary(bigFiveData)
##     case_id         country               age             sex       
##  Min.   :     1   Length:307313      Min.   :10.00   Min.   :1.000  
##  1st Qu.: 83653   Class :character   1st Qu.:18.00   1st Qu.:1.000  
##  Median :166286   Mode  :character   Median :22.00   Median :2.000  
##  Mean   :166682                      Mean   :25.19   Mean   :1.602  
##  3rd Qu.:249627                      3rd Qu.:29.00   3rd Qu.:2.000  
##  Max.   :334161                      Max.   :99.00   Max.   :2.000  
##  agreeable_score  extraversion_score openness_score   conscientiousness_score
##  Min.   :0.2000   Min.   :0.2000     Min.   :0.2533   Min.   :0.2067         
##  1st Qu.:0.6400   1st Qu.:0.6000     1st Qu.:0.6733   1st Qu.:0.6300         
##  Median :0.7033   Median :0.6800     Median :0.7367   Median :0.7067         
##  Mean   :0.6968   Mean   :0.6723     Mean   :0.7339   Mean   :0.7020         
##  3rd Qu.:0.7633   3rd Qu.:0.7500     3rd Qu.:0.7967   3rd Qu.:0.7767         
##  Max.   :1.0000   Max.   :0.9933     Max.   :0.9967   Max.   :1.0000         
##  neuroticism_score
##  Min.   :0.1967   
##  1st Qu.:0.4867   
##  Median :0.5700   
##  Mean   :0.5744   
##  3rd Qu.:0.6600   
##  Max.   :0.9967
sapply(bigFiveData, class)
##                 case_id                 country                     age 
##               "integer"             "character"               "integer" 
##                     sex         agreeable_score      extraversion_score 
##               "integer"               "numeric"               "numeric" 
##          openness_score conscientiousness_score       neuroticism_score 
##               "numeric"               "numeric"               "numeric"

Vidimo iz prioloženog tip podataka danih varijabli. Zemlja je zadana stringom, dok su ostale varijable brojčane(int ili numeric). Svi faktori koje promatramo su numeric tipa i imaju vrijednost od 0 do 1.

Sada gledamo postoje li u našem skupu podataka nedostajuće vrijednosti jer one mogu poremetiti rezultate testa. Promatramo sve varijable i brojimo koliko je NA vrijednosti ako ih ima.

for (col_name in names(bigFiveData)){
  if (sum(is.na(bigFiveData[,col_name])) > 0){
    cat('Ukupno nedostajućih vrijednosti za varijablu ',col_name, ': ', sum(is.na(bigFiveData[,col_name])),'\n')
  }
}

Po rezultatima vidimo da nemamo NA vrijednosti(neodostajućih) ni u jednoj varijabli. Podaci su konzistentni.

Sada kad smo pregledali podatke koje imamo, možemo krenuti na testiranje hipoteza.

T test - Ovisnost koeficijenta otvorenosti o starosti ispitanika

Ako grupiramo ispitanike na mlade i stare, možemo li vidjeti značajnu razliku u otvorenosti prema novim iskustvima?

pretpostavka:

H0: Mladi su jednako otvoreni prema novim iskustvima kao i stari.

H1: Mladi su otvoreniji prema novim iskustvima nego stari.

Ovdje ćemo koristiti jednostrani t-test za testiranje.

T-test za prepostavku uzima normalnost podataka, pa prije početka testiranja moramo pokazati normalnost.

Za početak ćemo nacrtati histogram i otprilike vidjeti.

Prije početka testiranja smo vidjeli podatke za godine, te možemo vidjeti da distribucija ne izgleda najbliže normalnoj, pa ćemo za mlade uzeti mlađe od 35, a za stare starije od 35 i vidjeti kako se graf ponaša.

young_people = bigFiveData[which(bigFiveData$age <= 35),]
old_people = bigFiveData[which(bigFiveData$age > 35),]
cat('Prosječna ocjena otvorenosti mladih ljudi iznosi ', mean(young_people$openness_score), '\n')
## Prosječna ocjena otvorenosti mladih ljudi iznosi  0.7340796
cat('Prosječna ocjena otvorenosti starih ljudi iznosi ', mean(old_people$openness_score), '\n')
## Prosječna ocjena otvorenosti starih ljudi iznosi  0.7331564
boxplot(young_people$openness_score, old_people$openness_score,
        names = c('Young people openness score', 'Old people openness score'),
        main = 'Boxplot of young and old people openness score')

Sada provjeravamo normalnost podataka. Najprije crtamo histograme za obje populacije.

hist(young_people$openness_score, main='Young people openness score', xlab='Openness score', ylab='Frequency')

hist(old_people$openness_score, main='Old people openness score', xlab='Openness score', ylab='Frequency')

Iz histograma možemo pretpostaviti da su razdiobe normalne, no radimo i Lilliefors test.

nortest::lillie.test(young_people$openness_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  young_people$openness_score
## D = 0.019024, p-value < 2.2e-16
nortest::lillie.test(old_people$openness_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  old_people$openness_score
## D = 0.024658, p-value < 2.2e-16
qqnorm(young_people$openness_score, main = 'Young people openness score')
qqline(young_people$openness_score, col = 'mediumblue')

qqnorm(old_people$openness_score, main = 'Old people openness score')
qqline(old_people$openness_score, col = 'mediumblue')

Zaključujemo da se radi o normalnim razdiobama.

Zatim provjeravamo jednakost varijanci dvaju uzoraka kako bismo mogli odabrati prikladni test. Radimo test o jednakosti varijanci.

var.test(young_people$openness_score, old_people$openness_score)
## 
##  F test to compare two variances
## 
## data:  young_people$openness_score and old_people$openness_score
## F = 0.95531, num df = 261300, denom df = 46011, p-value = 1.272e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9419809 0.9687558
## sample estimates:
## ratio of variances 
##          0.9553078

Dobili smo da je p-vrijednost jako mala, što znači da možemo odbaciti hipotezu H0 da su varijance jednake.

Provedimo sada t-test na podatcima.

t.test(young_people$openness_score, old_people$openness_score, alt = "greater", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  young_people$openness_score and old_people$openness_score
## t = 2.0512, df = 62481, p-value = 0.02012
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.0001828934          Inf
## sample estimates:
## mean of x mean of y 
## 0.7340796 0.7331564

Iz testa vidimo da je p-vrijednost manja od 0.05 pa možemo odbaciti hipotezu H0 da su mlađi i stariji ljudi jednako otvoreni prema novim iskustvima u korist hipoteze H1 koja kaže da su mlađi ljudi otvoreniji prema novim iskustvima.

T - test - Usporedba koeficijenta ekstraverzije SAD - a i ostatka svijeta

U SAD-u su ljudi ekstravertiraniji nego u ostatku svijeta.

H0: Ljudi u SAD-u imaju jednak koeficijent ekstraverzije kao i u ostatku svijeta. H1: Ljudi u SAD-u imaju veći koeficijent ekstraverzije nego ljudi u ostatku svijeta.

Americans = bigFiveData[which(bigFiveData$country == 'USA'),]
Others = bigFiveData[which(bigFiveData$country != 'USA'),]
cat('Prosječni koeficijent ekstraverzije kod Amerikanaca iznosi ', mean(Americans$extraversion_score), '\n')
## Prosječni koeficijent ekstraverzije kod Amerikanaca iznosi  0.6753184
cat('Prosječni koeficijent ekstraverzije u ostatku svijeta ', mean(Others$extraversion_score), '\n')
## Prosječni koeficijent ekstraverzije u ostatku svijeta  0.6656154
boxplot(Americans$extraversion_score, Others$extraversion_score,
        names = c('Americans extraversion score', 'Other people extraversion score'),
        main = 'Boxplot of extraversion score in USA and other countries')

Sada provjeravamo normalnost podataka, stoga najprije crtamo histogram.

hist(Americans$extraversion_score, main='Americans extraversion score', xlab='Extraversion score', ylab='Frequency')

hist(Others$extraversion_score, main='Others extraversion score', xlab='Extraversion score', ylab='Frequency')

nortest::lillie.test(Americans$extraversion_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  Americans$extraversion_score
## D = 0.034115, p-value < 2.2e-16
nortest::lillie.test(Others$extraversion_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  Others$extraversion_score
## D = 0.034227, p-value < 2.2e-16
qqnorm(Americans$extraversion_score, main = 'Americans extraversion score')
qqline(Others$extraversion_score, col = 'mediumblue')

qqnorm(Others$extraversion_score, main = 'Others extraversion score')
qqline(Others$extraversion_score, col = 'mediumblue')

Iz dosadašnjih rezultata zaključujemo da se radi o normalnim razdiobama.

Zatim provjeravamo jednakost varijanci dvaju uzoraka kako bismo mogli odabrati prikladni test. Radimo test o jednakosti varijanci.

var.test(Americans$extraversion_score, Others$extraversion_score)
## 
##  F test to compare two variances
## 
## data:  Americans$extraversion_score and Others$extraversion_score
## F = 1.0394, num df = 212624, denom df = 94687, p-value = 3.215e-12
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  1.028148 1.050659
## sample estimates:
## ratio of variances 
##           1.039355

Zbog jako male p-vrijednosti možemo odbaciti hipotezu H0 da su varijance jednake.

Naposlijetku provodimo t-test na podacima.

t.test(Americans$extraversion_score, Others$extraversion_score, alt = "greater", var.equal = FALSE)
## 
##  Welch Two Sample t-test
## 
## data:  Americans$extraversion_score and Others$extraversion_score
## t = 23.041, df = 184977, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.009010334         Inf
## sample estimates:
## mean of x mean of y 
## 0.6753184 0.6656154

Vidimo da je p-vrijednost jako mala, zbog čega možemo odbaciti hipotezu H0 i zaključiti u korist H1 da su ljudi u SAD-u ekstravertiraniji od ostatka svijeta.

ANOVA test - Usporedba savjesnosti i populacije u pojedinim regijama

U ovom odjeljku ćemo provesti ANOVA test varijanci na temelju sljedećih hipoteza. \[ \begin{aligned} H_0 & : \mu_A = \mu_E = \mu_N \\ H_1 & : \neg H_0. \end{aligned} \]

Također, nakon provedene jednofaktorske ANOVE provest ćemo i post hoc test i provjeriti tvrdnju da je populacija istočne Azije svjesnija od ostatka populacije.

Učitavamo potrebne pakete za test

library(dplyr)
library(ggplot2)
library(ggpubr)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ tibble  3.1.5     ✓ purrr   0.3.4
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.0.2     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(broom)
library(AICcmodavg)
library(tidyverse)
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following object is masked from 'package:purrr':
## 
##     compact
## The following object is masked from 'package:ggpubr':
## 
##     mutate
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
require(nortest)
## Loading required package: nortest

#Grupiranje zemalja u regije

U skladu s gornjim ispisom i brojem rezultata u pojedinim zemljama, odlučili smo pojedine zemlje grupirati u regije i uzeti dio podataka u pojedinim zemljama koje imaju previše podataka u odnosu na druge. Regije su Europe, North America i Asia & Pacific.

Zemlje u pojedinim regijama su: Asia & Pacific - Australia, India, New Zealan, Philippine, Singapore, Thailand Europe - Germany, Finland, Ireland, Netherland, UK North America - Canada, USA

groupedData = select(bigFiveData, c('conscientiousness_score', 'country'))

groupedData$'Region' <- as.factor(ifelse(groupedData$country == "Australia" | groupedData$country == "India" | groupedData$country == "New Zealan" | groupedData$country == "Philippine" | groupedData$country == "Singapore"| groupedData$country == "Thailand", 'Asia & Pacific',
                                  ifelse(groupedData$country == "Germany" | groupedData$country == "Finland" | groupedData$country == "Ireland" | groupedData$country == "Netherland" | groupedData$country == "UK", 'Europe', 
                                  ifelse(groupedData$country == "Canada" | groupedData$country == "USA", 'North America',NA))))

summary(bigFiveData)
##     case_id         country               age             sex       
##  Min.   :     1   Length:307313      Min.   :10.00   Min.   :1.000  
##  1st Qu.: 83653   Class :character   1st Qu.:18.00   1st Qu.:1.000  
##  Median :166286   Mode  :character   Median :22.00   Median :2.000  
##  Mean   :166682                      Mean   :25.19   Mean   :1.602  
##  3rd Qu.:249627                      3rd Qu.:29.00   3rd Qu.:2.000  
##  Max.   :334161                      Max.   :99.00   Max.   :2.000  
##  agreeable_score  extraversion_score openness_score   conscientiousness_score
##  Min.   :0.2000   Min.   :0.2000     Min.   :0.2533   Min.   :0.2067         
##  1st Qu.:0.6400   1st Qu.:0.6000     1st Qu.:0.6733   1st Qu.:0.6300         
##  Median :0.7033   Median :0.6800     Median :0.7367   Median :0.7067         
##  Mean   :0.6968   Mean   :0.6723     Mean   :0.7339   Mean   :0.7020         
##  3rd Qu.:0.7633   3rd Qu.:0.7500     3rd Qu.:0.7967   3rd Qu.:0.7767         
##  Max.   :1.0000   Max.   :0.9933     Max.   :0.9967   Max.   :1.0000         
##  neuroticism_score
##  Min.   :0.1967   
##  1st Qu.:0.4867   
##  Median :0.5700   
##  Mean   :0.5744   
##  3rd Qu.:0.6600   
##  Max.   :0.9967
groupedData = select(groupedData, c('conscientiousness_score', 'Region'))

groupedData = filter(groupedData, (Region == "Asia & Pacific" | Region == "Europe" | Region == "North America" | Region == "Asia & Pacific"))
groupedData = filter(groupedData, Region != 'Other' &  conscientiousness_score != 'Other')
 
groupedData = groupedData[order(groupedData$Region),]
groupedData = groupedData[0:72334,]

summary(groupedData)
##  conscientiousness_score            Region     
##  Min.   :0.2133          Asia & Pacific:22254  
##  1st Qu.:0.6167          Europe        :25080  
##  Median :0.6900          North America :25000  
##  Mean   :0.6893                                
##  3rd Qu.:0.7633                                
##  Max.   :1.0000

#Provjera normalnosti podataka Računamo ANOVU kako bi imali reziduale, te ćemo pomoću njih provjeriti jednakost varijanci i provjeriti normalnost prije provođenja ANOVA

Provjera za savjesnost

lillie.test(groupedData$conscientiousness_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  groupedData$conscientiousness_score
## D = 0.016006, p-value < 2.2e-16
test = groupedData[0:3000,]
lillie.test(test$conscientiousness_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  test$conscientiousness_score
## D = 0.022916, p-value = 0.0009769

Provjera za Aziju

lillie.test(groupedData$conscientiousness_score[groupedData$Region=='Asia & Pacific'])
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  groupedData$conscientiousness_score[groupedData$Region == "Asia & Pacific"]
## D = 0.012807, p-value = 1.345e-08

Provjera za Sjevernu Ameriku

lillie.test(groupedData$conscientiousness_score[groupedData$Region=='North America'])
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  groupedData$conscientiousness_score[groupedData$Region == "North America"]
## D = 0.020484, p-value < 2.2e-16

Provjera za Europu

lillie.test(groupedData$conscientiousness_score[groupedData$Region=='Europe'])
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  groupedData$conscientiousness_score[groupedData$Region == "Europe"]
## D = 0.017663, p-value < 2.2e-16

#Vizualni prikaz normalnosti

hist(groupedData$conscientiousness_score[groupedData$Region=='Europe'])

hist(groupedData$conscientiousness_score[groupedData$Region=='North America'])

hist(groupedData$conscientiousness_score[groupedData$Region=='Asia & Pacific'])

Nakon provedenog Liliforce testa normalnosti, vidimo da podatci nisu normalni. No isto tako, napravili smo test s 3000 uzoraka i p vrijednost testa se značajno povećala, što znači da je Liliforce osjetljiv na količinu podataka i na outliere. No naši outliere se ne smiju izbaciti.

#Sada ćemo testirati homogenost varijanci regija pomoću naivnog Barlettovim testom

bartlett.test(groupedData$conscientiousness_score ~ groupedData$Region)
## 
##  Bartlett test of homogeneity of variances
## 
## data:  groupedData$conscientiousness_score by groupedData$Region
## Bartlett's K-squared = 7.6132, df = 2, p-value = 0.02222
var((groupedData$conscientiousness_score[groupedData$Region=='Europe']))
## [1] 0.01133189
var((groupedData$conscientiousness_score[groupedData$Region=='North America']))
## [1] 0.01121378
var((groupedData$conscientiousness_score[groupedData$Region =='Asia & Pacific']))
## [1] 0.01093929
#

Iz testa homogenosti koji je nužan za Anovu možemo zaključiti da varijance nisu homogene, no opet uzevši u obzir količinu podataka i p vrijednost testa homogenosti smatrat ćemo varijance homogene.

#Grafički prikaz podataka

boxplot(groupedData$conscientiousness_score ~ (groupedData$Region))

Na temelju grafa pretpostavljamo da nulta hipoteza ne vrijedi …

#ANOVA test

an = aov(groupedData$conscientiousness_score ~ (groupedData$Region))
summary(an)
##                       Df Sum Sq Mean Sq F value Pr(>F)    
## groupedData$Region     2    7.2   3.593   321.7 <2e-16 ***
## Residuals          72331  808.0   0.011                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Iz podataka vidimo da hipoteza h0 ne vrijedi, već na gornjem grafu smo uočili razliku.

#Post hoc Dunnett’s test

library(DescTools)
DunnettTest(x=groupedData$conscientiousness_score, g=groupedData$Region)
## 
##   Dunnett's test for comparing several treatments with a control :  
##     95% family-wise confidence level
## 
## $`Asia & Pacific`
##                                      diff      lwr.ci       upr.ci    pval    
## Europe-Asia & Pacific        -0.020688203 -0.02283796 -0.018538443 2.5e-14 ***
## North America-Asia & Pacific  0.000475986 -0.00167539  0.002627362  0.8389    
## 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Iz navedenog vidimo da je populacija Azije manje savjesna od populacije Sjeverne Amerike.

LOGISTIČKA REGRESIJA Postoje li razlike u karakteristikama prema spolu ispitanika? Možemo li temeljem nekih drugih varijabli odrediti spol ispitanika?

Prvi dio hipoteze potoje li razlike u karektaristikama ćemo razmotriti tako što ćemo primjerice uzet u obzir karekteristike neuroticizam i otvorenost. Prvo ćemo prikazati histograme

bigFiveData$`standardized_sex` <- bigFiveData$sex-1
w = bigFiveData[which(bigFiveData$standardized_sex == 1),]
m = bigFiveData[which(bigFiveData$standardized_sex == 0),]

hist(m$`openness_score`)

hist(w$`openness_score`)

hist(m$`neuroticism_score`)

hist(w$`neuroticism_score`)

hist(m$`extraversion_score`)

hist(w$`extraversion_score`)

hist(m$`conscientiousness_score`)

hist(w$`conscientiousness_score`)

hist(m$`agreeable_score`)

hist(w$`agreeable_score`)

Iz histograma se čini da su razdiobe normalne.

qqnorm(m$`openness_score`, main=' ')
qqline(m$`openness_score`, col = "medium blue")

qqnorm(w$`openness_score`, main=' ')
qqline(w$`openness_score`, col = "medium blue ")

qqnorm(m$`neuroticism_score`, main=' ')
qqline(m$`neuroticism_score`, col = "medium blue")

qqnorm(w$`neuroticism_score`, main=' ')
qqline(w$`neuroticism_score`, col = "medium blue ")

qqnorm(m$`extraversion_score`, main=' ')
qqline(m$`extraversion_score`, col = "medium blue ")

qqnorm(w$`extraversion_score`, main=' ')
qqline(w$`extraversion_score`, col = "medium blue ")

qqnorm(m$`conscientiousness_score`, main=' ')
qqline(m$`conscientiousness_score`, col = "medium blue ")

qqnorm(w$`conscientiousness_score`, main=' ')
qqline(w$`conscientiousness_score`, col = "medium blue ")

qqnorm(m$`agreeable_score`, main=' ')
qqline(m$`agreeable_score`, col = "medium blue ")

qqnorm(w$`agreeable_score`, main=' ')
qqline(w$`agreeable_score`, col = "medium blue ")

Qqplot-ovi ukazuju na normalnost razdioba. Provjerit ćemo još normalnost pomoću Lillie testa.

nortest::lillie.test(w$openness_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  w$openness_score
## D = 0.019992, p-value < 2.2e-16
nortest::lillie.test(m$openness_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  m$openness_score
## D = 0.01799, p-value < 2.2e-16
nortest::lillie.test(w$neuroticism_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  w$neuroticism_score
## D = 0.018533, p-value < 2.2e-16
nortest::lillie.test(m$neuroticism_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  m$neuroticism_score
## D = 0.025863, p-value < 2.2e-16
nortest::lillie.test(w$extraversion_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  w$extraversion_score
## D = 0.032209, p-value < 2.2e-16
nortest::lillie.test(m$extraversion_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  m$extraversion_score
## D = 0.033713, p-value < 2.2e-16
nortest::lillie.test(w$conscientiousness_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  w$conscientiousness_score
## D = 0.022818, p-value < 2.2e-16
nortest::lillie.test(m$conscientiousness_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  m$conscientiousness_score
## D = 0.014015, p-value < 2.2e-16
nortest::lillie.test(w$agreeable_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  w$agreeable_score
## D = 0.040539, p-value < 2.2e-16
nortest::lillie.test(m$agreeable_score)
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  m$agreeable_score
## D = 0.038861, p-value < 2.2e-16

Provest cemo t-test za otvorenost: H0: Muškarci su jednako otvoreni kao i žene odnosno ne postoji razlika u otvorenosti kod muškaraca i žena.

var.test(w$`openness_score`, m$`openness_score`)
## 
##  F test to compare two variances
## 
## data:  w$openness_score and m$openness_score
## F = 0.90724, num df = 185148, denom df = 122163, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8980113 0.9165500
## sample estimates:
## ratio of variances 
##          0.9072382
t.test(w$`openness_score`, m$`openness_score`, 
       alt = "greater", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  w$openness_score and m$openness_score
## t = 55.37, df = 307311, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.0172636       Inf
## sample estimates:
## mean of x mean of y 
## 0.7410141 0.7232220

Provest cemo t-test i za neuroticnost: H0: Muškarci su jednako neurotični kao i žene odnosno ne postoji razlika u neurotičnosti kod muškaraca i žena.

var.test(w$`neuroticism_score`, m$`neuroticism_score`)
## 
##  F test to compare two variances
## 
## data:  w$neuroticism_score and m$neuroticism_score
## F = 0.96421, num df = 185148, denom df = 122163, p-value = 2.546e-12
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9544022 0.9741051
## sample estimates:
## ratio of variances 
##          0.9642086
t.test(w$`neuroticism_score`, m$`neuroticism_score`, 
       alt = "greater", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  w$neuroticism_score and m$neuroticism_score
## t = 111.07, df = 307311, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.04973036        Inf
## sample estimates:
## mean of x mean of y 
## 0.5944653 0.5439874

Provest cemo t-test za savjesnost: H0: Muškarci su jednako savjesni kao i žene odnosno ne postoji razlika u savjesnosti kod muškaraca i žena.

var.test(w$`conscientiousness_score`, m$`conscientiousness_score`)
## 
##  F test to compare two variances
## 
## data:  w$conscientiousness_score and m$conscientiousness_score
## F = 0.9671, num df = 185148, denom df = 122163, p-value = 1.321e-10
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.9572659 0.9770279
## sample estimates:
## ratio of variances 
##          0.9671017
t.test(w$`conscientiousness_score`, m$`conscientiousness_score`, 
       alt = "greater", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  w$conscientiousness_score and m$conscientiousness_score
## t = 24.908, df = 307311, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.009198189         Inf
## sample estimates:
## mean of x mean of y 
## 0.7059135 0.6960649

Provest cemo t-test za ekstroverziju: H0: Muškarci su jednako ekstrovertni kao i žene odnosno ne postoji razlika u ekstroverziji kod muškaraca i žena.

var.test(w$`extraversion_score`, m$`extraversion_score`)
## 
##  F test to compare two variances
## 
## data:  w$extraversion_score and m$extraversion_score
## F = 0.8961, num df = 185148, denom df = 122163, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8869908 0.9053020
## sample estimates:
## ratio of variances 
##          0.8961045
t.test(w$`extraversion_score`, m$`extraversion_score`, 
       alt = "greater", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  w$extraversion_score and m$extraversion_score
## t = 43.539, df = 307311, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.01673124        Inf
## sample estimates:
## mean of x mean of y 
## 0.6792409 0.6618528

Provest cemo t-test za slaganje tj pristanak na kompromis: H0: Muškarci su jednako spremni na slaganje i kompromis kao i žene odnosno ne postoji razlika kod muškaraca i žena.

var.test(w$`agreeable_score`, m$`agreeable_score`)
## 
##  F test to compare two variances
## 
## data:  w$agreeable_score and m$agreeable_score
## F = 0.879, num df = 185148, denom df = 122163, p-value < 2.2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.8700581 0.8880198
## sample estimates:
## ratio of variances 
##          0.8789978
t.test(w$`agreeable_score`, m$`agreeable_score`, 
       alt = "greater", var.equal = TRUE)
## 
##  Two Sample t-test
## 
## data:  w$agreeable_score and m$agreeable_score
## t = 129.15, df = 307311, p-value < 2.2e-16
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.04282648        Inf
## sample estimates:
## mean of x mean of y 
## 0.7140516 0.6706726

Vidimo da je p-vrijednost jako mala stoga možemo zaključiti da postoje razlike u karakteristikama kod muškaraca, odnosno kod žena.

Nastavno na gore provedene testove, prirodno se nameće pitanje, možemo li na osnovu ocjene karakteristika odrediti spol.Prvo ćemo pogledati prosječne ocjene karakteristika kod žena odnosno kod muškaraca.

bigFiveData$`standardized_sex` <- bigFiveData$sex-1
w = bigFiveData[which(bigFiveData$standardized_sex == 1),]
m = bigFiveData[which(bigFiveData$standardized_sex == 0),]

cat('Prosječna ocjena otvorenosti žena je ', mean(w$openness_score), '\n')
## Prosječna ocjena otvorenosti žena je  0.7410141
cat('Prosječna ocjena otvornosti muškaraca je ', mean(m$openness_score), '\n')
## Prosječna ocjena otvornosti muškaraca je  0.723222
cat('Prosječna ocjena neuroticizma žena je ', mean(w$neuroticism_score), '\n')
## Prosječna ocjena neuroticizma žena je  0.5944653
cat('Prosječna ocjena neuroticizma muškaraca je ', mean(m$neuroticism_score), '\n')
## Prosječna ocjena neuroticizma muškaraca je  0.5439874
cat('Prosječna ocjena ekstraverzije žena je ', mean(w$extraversion_score), '\n')
## Prosječna ocjena ekstraverzije žena je  0.6792409
cat('Prosječna ocjena ekstraverzije muškaraca je ', mean(m$extraversion_score), '\n')
## Prosječna ocjena ekstraverzije muškaraca je  0.6618528
cat('Prosječna ocjena slaganja žena je ', mean(w$agreeable_score), '\n')
## Prosječna ocjena slaganja žena je  0.7140516
cat('Prosječna ocjena slaganja muškaraca je ', mean(m$agreeable_score), '\n')
## Prosječna ocjena slaganja muškaraca je  0.6706726
cat('Prosječna ocjena savjesnosti žena je ', mean(w$conscientiousness_score), '\n')
## Prosječna ocjena savjesnosti žena je  0.7059135
cat('Prosječna ocjena savjesnosti muškaraca je ', mean(m$conscientiousness_score), '\n')
## Prosječna ocjena savjesnosti muškaraca je  0.6960649

Iz ovoga vidimo da postoje razlike izmedju muškaraca i žena, ne toliko velike ali postoje. Provodimo logističku regresiju:

logreg_model = glm(standardized_sex ~ neuroticism_score + openness_score + extraversion_score + conscientiousness_score + agreeable_score , data=bigFiveData, family = binomial())
summary(logreg_model)
## 
## Call:
## glm(formula = standardized_sex ~ neuroticism_score + openness_score + 
##     extraversion_score + conscientiousness_score + agreeable_score, 
##     family = binomial(), data = bigFiveData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8707  -1.0882   0.6130   0.9313   2.9619  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             -13.35816    0.07622 -175.26   <2e-16 ***
## neuroticism_score         7.24558    0.04400  164.68   <2e-16 ***
## openness_score            0.91112    0.04819   18.91   <2e-16 ***
## extraversion_score        4.35782    0.04383   99.43   <2e-16 ***
## conscientiousness_score   2.61600    0.04483   58.35   <2e-16 ***
## agreeable_score           6.09270    0.04877  124.93   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 413025  on 307312  degrees of freedom
## Residual deviance: 362792  on 307307  degrees of freedom
## AIC: 362804
## 
## Number of Fisher Scoring iterations: 3
yHat <- logreg_model$fitted.values > 0.5
conf_table <- table(bigFiveData$standardized_sex, yHat)

conf_table
##    yHat
##      FALSE   TRUE
##   0  57695  64469
##   1  31297 153852
cat("Accuracy: ", sum(diag(conf_table)) / sum(conf_table), "\n")
## Accuracy:  0.6883763
cat("Precision: ", conf_table[2,2] / sum(conf_table[,2]),"\n")
## Precision:  0.7047055
cat("Recall: ", conf_table[2,2] / sum(conf_table[2,]),"\n")
## Recall:  0.8309632
cat("Specificity: ", conf_table[1,1] / sum(conf_table[,1]),"\n")
## Specificity:  0.6483167
cat("\nRsq: ", 1 - logreg_model$deviance/logreg_model$null.deviance)
## 
## Rsq:  0.1216221

LINEARNA REGRESIJA - Ovisnost dobi o faktorima koje imamo na raspolaganju

U posljednjem testu promatrat ćemo varijablu dobi (starosti) i uspoređivati ju sa svih 5 faktora koje ispitujemo. Prvo ćemo vidjeti povezanost dobi sa svim tim faktorima, nakon čega ćemo izabrati jedan na kojem ćemo raditi linearnu regresiju te na temelju tog faktora pokušati odrediti dob ispitanika.

Pa krenimo s testiranjem. Za početak ćemo podijeliti varijablu dobi, koja je numerička, na kategorije: mlađi(0-15 godina), srednje dobi(16-30 godina) i starije(30+ godina)..

young = bigFiveData[which(bigFiveData$age <= 15),]
middle = bigFiveData[which(bigFiveData$age > 15 & bigFiveData$age <= 30),]
old = bigFiveData[which(bigFiveData$age > 30),]

Nakon što smo podijelili podatke o dobi na kategorije izračunati ćemo prosječne srednje vrijednosti, prikazati box plotove, te scatter plotove s regresijskom linijom i na temelju tih svih podataka zaključiti koji je faktor najpovezaniji s dobi. S izabranim faktorom napravit ćemo linearnu regresiju te na temelju rezultata tog faktora pokušati odrediti dob.

Prvo ćemo napraviti linearne modele za svaki faktor da bi mogli “nacrtati” liniju na scatter plotu.

fit.opennes = lm(age~openness_score,data=bigFiveData)
fit.extraversion = lm(age~extraversion_score,data=bigFiveData)
fit.agreeable = lm(age~agreeable_score,data=bigFiveData)
fit.conscientiousness = lm(age~conscientiousness_score,data=bigFiveData)
fit.neuroticism = lm(age~neuroticism_score,data=bigFiveData)

EKTROVERZIJA

cat('Prosječna ocjena ekstraverzije mladih ljudi iznosi ', mean(young$extraversion_score), '\n')
## Prosječna ocjena ekstraverzije mladih ljudi iznosi  0.6924383
cat('Prosječna ocjena ekstraverzije srednjih ljudi iznosi ', mean(middle$extraversion_score), '\n')
## Prosječna ocjena ekstraverzije srednjih ljudi iznosi  0.6774276
cat('Prosječna ocjena ekstraverzije starijih ljudi iznosi ', mean(old$extraversion_score), '\n')
## Prosječna ocjena ekstraverzije starijih ljudi iznosi  0.6503976
boxplot(young$extraversion_score, middle$extraversion_score, old$extraversion_score,
        names = c('Young people extraversion score','Middle aged people extraversion score', 'Old people extraversion score'),
        main = 'Boxplot of young, middle aged and old people extraversion score')

smoothScatter(bigFiveData$extraversion_score, bigFiveData$age, main = "Scatter plot for extraversion score and age") #graficki prikaz podataka 
lines(bigFiveData$extraversion_score,fit.extraversion$fitted.values,col="red") #linija

UGODNOST

cat('Prosječna ocjena ugodnosti mladih ljudi iznosi ', mean(young$agreeable_score), '\n')
## Prosječna ocjena ugodnosti mladih ljudi iznosi  0.6671362
cat('Prosječna ocjena ugodnosti srednjih ljudi iznosi ', mean(middle$agreeable_score), '\n')
## Prosječna ocjena ugodnosti srednjih ljudi iznosi  0.6918796
cat('Prosječna ocjena ugodnosti starijih ljudi iznosi ', mean(old$agreeable_score), '\n')
## Prosječna ocjena ugodnosti starijih ljudi iznosi  0.7210159
boxplot(young$agreeable_score, middle$agreeable_score, old$agreeable_score,
        names = c('Young people agreeable score','Middle aged people agreeable score', 'Old people agreeable score'),
        main = 'Boxplot of young, middle aged and old people agreeable score')

smoothScatter(bigFiveData$agreeable_score, bigFiveData$age, main = "Scatter plot for agreeable score and age") #graficki prikaz podataka 
lines(bigFiveData$agreeable_score,fit.agreeable$fitted.values,col="red") #linija

SAVJESNOST

cat('Prosječna ocjena savjesnosti mladih ljudi iznosi ', mean(young$conscientiousness_score), '\n')
## Prosječna ocjena savjesnosti mladih ljudi iznosi  0.6476224
cat('Prosječna ocjena savjesnosti srednjih ljudi iznosi ', mean(middle$conscientiousness_score), '\n')
## Prosječna ocjena savjesnosti srednjih ljudi iznosi  0.6936849
cat('Prosječna ocjena savjesnosti starijih ljudi iznosi ', mean(old$conscientiousness_score), '\n')
## Prosječna ocjena savjesnosti starijih ljudi iznosi  0.7441107
boxplot(young$conscientiousness_score, middle$conscientiousness_score, old$conscientiousness_score,
        names = c('Young people conscientiousness score','Middle aged people conscientiousness score', 'Old people conscientiousness score'),
        main = 'Boxplot of young, middle aged and old people conscientiousness score')

smoothScatter(bigFiveData$conscientiousness_score, bigFiveData$age, main = "Scatter plot for conscientiousness score and age") #graficki prikaz podataka 
lines(bigFiveData$conscientiousness_score,fit.conscientiousness$fitted.values,col="red") #linija

NEUROTICIZAM

cat('Prosječna ocjena neuroticizma mladih ljudi iznosi ', mean(young$neuroticism_score), '\n')
## Prosječna ocjena neuroticizma mladih ljudi iznosi  0.5960167
cat('Prosječna ocjena neuroticizma srednjih ljudi iznosi ', mean(middle$neuroticism_score), '\n')
## Prosječna ocjena neuroticizma srednjih ljudi iznosi  0.5797046
cat('Prosječna ocjena neuroticizma starijih ljudi iznosi ', mean(old$neuroticism_score), '\n')
## Prosječna ocjena neuroticizma starijih ljudi iznosi  0.5513757
boxplot(young$neuroticism_score, middle$neuroticism_score, old$neuroticism_score,
        names = c('Young people neuroticism score','Middle aged people neuroticism score', 'Old people neuroticism score'),
        main = 'Boxplot of young, middle aged and old people neuroticism score')

smoothScatter(bigFiveData$neuroticism_score, bigFiveData$age, main = "Scatter plot for neuroticism score and age") #graficki prikaz podataka 
lines(bigFiveData$neuroticism_score,fit.neuroticism$fitted.values,col="red") #linija

OTVORENOST NOVIM ISKUSTVIMA

cat('Prosječna ocjena otvorenosti mladih ljudi iznosi ', mean(young$openness_score), '\n')
## Prosječna ocjena otvorenosti mladih ljudi iznosi  0.7281298
cat('Prosječna ocjena otvorenosti srednjih ljudi iznosi ', mean(middle$openness_score), '\n')
## Prosječna ocjena otvorenosti srednjih ljudi iznosi  0.7343365
cat('Prosječna ocjena otvorenosti starijih ljudi iznosi ', mean(old$openness_score), '\n')
## Prosječna ocjena otvorenosti starijih ljudi iznosi  0.7344109
boxplot(young$openness_score, middle$openness_score, old$openness_score,
        names = c('Young people openness score','Middle aged people openness score', 'Old people openness score'),
        main = 'Boxplot of young, middle aged and old people openness score')

smoothScatter(bigFiveData$openness_score, bigFiveData$age,main = 'Scatter plot for openness scores and age') #graficki prikaz podataka 
lines(bigFiveData$openness_score,fit.opennes$fitted.values,col="red") #linija

Nakon što smo pogledali prosjeke te napravili box plotove i scatter plotove s regresijskim linijama, bez računanja testova za svaki faktor posebno, možemo vidjeti da je savjesnost najviše ovisna o dobi. Zbog toga sada biramo savjesnost i radimo daljnja testiranja za dob vs savjesnost ispitanika.

Prije svakog testiranja gdje pretpostavljamo normalnost, moramo ju pokazati. Za početak ćemo prikazati podatke u histogramu.

Histogram za savjesnost mlađih:

hist(young$conscientiousness_score, main='Younger people conscientiousness score', xlab='Conscientiousness score', ylab='Frequency')

Histogram za savjesnost srednjh:

hist(middle$conscientiousness_score, main='Middle aged people conscientiousness score', xlab='Conscientiousness score', ylab='Frequency')

Histogram za savjesnost starijih:

hist(old$conscientiousness_score, main='Older people conscientiousness score', xlab='Conscientiousness score', ylab='Frequency')

Prije linearne regresiju trebamo provjeriti normalnost reziduala(razlika..) i homogenost varijance.

Normalnost reziduala provjerit ćemo grafički, pomoću q-q plota te statistički pomoću Kolmogorov-Smirnovljevog testa.

selected.model = fit.conscientiousness
#smoothScatter(selected.model$residuals)

Histogram:

hist((selected.model$residuals))

hist(rstandard(selected.model))

Q-Q plot:

 #q-q plot reziduala s linijom normalne distribucije
qqnorm(rstandard(selected.model)) 
qqline(rstandard(selected.model))

Na danom q-q plotu i na histogramima vidimo da reziduali nisu normalno distribuirani. Mičemo outliere gledajući varijablu godine i varijablu savjesnosti da vidimo dobivamo li “normalnije” rezultate…

Q1 <- quantile(bigFiveData$conscientiousness_score, .25)
Q3 <- quantile(bigFiveData$conscientiousness_score, .75)
IQR <- IQR(bigFiveData$conscientiousness_score)
#Q1 <- quantile(bigFiveData$age, .25)
#Q3 <- quantile(bigFiveData$age, .75)
#IQR <- IQR(bigFiveData$age)
bigFiveDatatemp <- subset(bigFiveData, bigFiveData$conscientiousness_score> (Q1 - 1.5*IQR) & bigFiveData$conscientiousness_score< (Q3 + 1.5*IQR))
Q1 <- quantile(bigFiveDatatemp$age, .25)
Q3 <- quantile(bigFiveDatatemp$age, .75)
IQR <- IQR(bigFiveDatatemp$age)
bigFiveDataNew <- subset(bigFiveDatatemp, bigFiveDatatemp$age> (Q1 - 1.5*IQR) & bigFiveDatatemp$age< (Q3 + 1.5*IQR))
dim(bigFiveDataNew)
## [1] 287789     10

Na temelju danih rezultata usporedbe godina sa svakim od 5 faktora, zaključili smo da ni za jedan od njih nema smisla raditi linearnu regresiju.. izabrali smo naizgled najlogičniji, ali prema dobivenom q-q plotu je očito da podaci nisu normalno distribuirani, pa ćemo propbati nešto novo. Na žalost ovaj test je “propao”.

Probat ćemo za početak obrnuto.. gledat ćemo kako savjesnot ispitanika ovisi o dobi.. (do sada smo gledali kako dob ovisi o savjesnosti i ostalim faktorima)

Zavisna varijabla: Y = conscientiousness_score Nezavisna varijabla: X = age

Radimo novi linearni model:

fit.conscientiousness_age_new = lm(conscientiousness_score~age, data = bigFiveDataNew)
fit.conscientiousness_age = lm(conscientiousness_score~age, data = bigFiveData)

Smooth Scatter plot za ovisnost savjesnosti o dobi:

smoothScatter(bigFiveData$age, bigFiveData$conscientiousness_score, main = "Scatter plot for conscientiousness score and age") #graficki prikaz podataka 
lines(bigFiveData$age,fit.conscientiousness_age$fitted.values,col="red") #linija

Sada radimo qq-plot za taj model:

qqnorm(rstandard(fit.conscientiousness_age))
qqline(rstandard(fit.conscientiousness_age))

Q-q plot daje dobre rezultate za razliku od prethodnog. Možemo naslutiti da se radi o nomalno distribuiranim podacima. Osim q-q plota, možemo napraviti i histogram za reziduale da grafički pokazećemo normalnost:

hist((fit.conscientiousness_age$residuals))

hist(rstandard(fit.conscientiousness_age))

S obzirom da Q-Q plot i histogrami nisu dovoljni za pokazivanje normalnosti, sada ćemo na danom modelu provesti statističke testove (KS i Lilliefors)

Kolmogorov Smirnovljev test:

ks.test(rstandard(fit.conscientiousness_age),"pnorm")
## Warning in ks.test(rstandard(fit.conscientiousness_age), "pnorm"): ties should
## not be present for the Kolmogorov-Smirnov test
## 
##  One-sample Kolmogorov-Smirnov test
## 
## data:  rstandard(fit.conscientiousness_age)
## D = 0.012243, p-value < 2.2e-16
## alternative hypothesis: two-sided
require(nortest)

Lillieforsov test:

lillie.test(rstandard(fit.conscientiousness_age))
## 
##  Lilliefors (Kolmogorov-Smirnov) normality test
## 
## data:  rstandard(fit.conscientiousness_age)
## D = 0.012243, p-value < 2.2e-16

Ovim statističkim testovima pokazali smo normalnost pa možemo koristiti te podatke i linearni model u daljnjem razmatranju.

Sada ćemo vidjeti kvalitetu prilagodbe modela podacima:

summary(fit.conscientiousness_age)
## 
## Call:
## lm(formula = conscientiousness_score ~ age, data = bigFiveData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60255 -0.06908  0.00282  0.07230  0.32423 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.301e-01  5.061e-04  1245.0   <2e-16 ***
## age         2.853e-03  1.867e-05   152.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1035 on 307311 degrees of freedom
## Multiple R-squared:  0.07059,    Adjusted R-squared:  0.07059 
## F-statistic: 2.334e+04 on 1 and 307311 DF,  p-value: < 2.2e-16

Možemo za kraj, nakon ovih testova pogledati i korelaciju među tim varijablama da potvrdimo da nisu linearno povezane.

cor(bigFiveDataNew$age, bigFiveDataNew$conscientiousness_score)
## [1] 0.2468758
cor.test(bigFiveDataNew$conscientiousness_score, bigFiveDataNew$age)
## 
##  Pearson's product-moment correlation
## 
## data:  bigFiveDataNew$conscientiousness_score and bigFiveDataNew$age
## t = 136.67, df = 287787, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2434418 0.2503035
## sample estimates:
##       cor 
## 0.2468758

Iz navedenih rezultata testa, po R-squared koji iznosi 6 % (u ovom slučaju kada smo makli outliere i po dobi i po savjesnosti).. Inače, da bi model za linearnu regresiju bio dobar, za neke socijalne eksperminte, zadovoljili bi se i s 30-40% sličnosti, ali 6 % je jako malo..

Nakon ovog neuspješnog testa, tj. odbijanja početne hipoteze da možemo savjesnost predviditi pomoću dobi ili obrnuto, u igru ćemo probati ubaciti još neki regresor o kojem ovisi savjesnost, te time pokušati pokazati linearnu povezanost.

Prije nego odlučimo koje ćemo varijable uzeti za regresore, pogledat ćemo koefijcijente koreliranosti. Inače, za višestruku regresiju ne smijemo koristiti varijable koje su jako korelirane jer mogu stvoriti probleme u interpretaciji rezultata.

Koeficijent korelacije godina i spola:

cor(bigFiveData$age, bigFiveData$sex)
## [1] -0.002110297
#cor.test(bigFiveData$age, bigFiveData$sex)

Vidimo da ove dvije varijable nisu korelirane, stoga ih možemo koristiti kao regresore u višestrukoj regresiji.

Radimo novi linearni model.

fit.conscientiousness_age_sex = lm(conscientiousness_score~age + sex, data = bigFiveData)

Rezultati provjere kvalitete modela:

summary(fit.conscientiousness_age_sex)
## 
## Call:
## lm(formula = conscientiousness_score ~ age + sex, data = bigFiveData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.59662 -0.06880  0.00267  0.07180  0.33024 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 6.141e-01  7.936e-04  773.82   <2e-16 ***
## age         2.854e-03  1.865e-05  153.00   <2e-16 ***
## sex         9.972e-03  3.812e-04   26.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1034 on 307310 degrees of freedom
## Multiple R-squared:  0.07266,    Adjusted R-squared:  0.07265 
## F-statistic: 1.204e+04 on 2 and 307310 DF,  p-value: < 2.2e-16

Nakon što smo napravili par teatova i vidjeli da je rezultat za R-squared malen što god uzeli, probat ćemo s random manjim uzorkom, jer kod R-squared može biti problem velika količina podataka jer on gleda varijancu pogresške.. Prvo pokušavamo za višestruku regresiju s godinama i spolom na jednoj, i savjesnošću na drugoj strani.

fit.conscientiousness_age_sex_test = lm(conscientiousness_score~age + sex, data = sample_n(bigFiveData, 100))
summary(fit.conscientiousness_age_sex_test)
## 
## Call:
## lm(formula = conscientiousness_score ~ age + sex, data = sample_n(bigFiveData, 
##     100))
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.299054 -0.075044  0.006606  0.081109  0.248949 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.603271   0.052537  11.483   <2e-16 ***
## age         0.001997   0.001259   1.586    0.116    
## sex         0.021925   0.024358   0.900    0.370    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1127 on 97 degrees of freedom
## Multiple R-squared:  0.03314,    Adjusted R-squared:  0.0132 
## F-statistic: 1.662 on 2 and 97 DF,  p-value: 0.1951

Scatter plot navedenog primjera:

plot(fit.conscientiousness_age_sex_test$fitted.values,fit.conscientiousness_age_sex_test$residuals)

Nakon toga probat ćemo početnu ideju, jednostavnu linearnu regresiju za godine vs savjesnost.

fit.conscientiousness_random = lm(conscientiousness_score~age, data = sample_n(bigFiveData, 100))
summary(fit.conscientiousness_random)
## 
## Call:
## lm(formula = conscientiousness_score ~ age, data = sample_n(bigFiveData, 
##     100))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34598 -0.08079 -0.00716  0.08642  0.25443 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.629476   0.031591   19.93   <2e-16 ***
## age         0.002799   0.001106    2.53    0.013 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1158 on 98 degrees of freedom
## Multiple R-squared:  0.06131,    Adjusted R-squared:  0.05173 
## F-statistic: 6.401 on 1 and 98 DF,  p-value: 0.013

Scatter plot:

plot(fit.conscientiousness_random$fitted.values,fit.conscientiousness_random$residuals)

U ovom slučaju promatrali smo uzorak od 100 i dobili smo da je R-squared 25% što je u slučaju promatranja ljudi i ljudskog ponašanja dobro, pa možemo zaključiti da je pretpostavka dobra i rezultat značajan.

Na kraju cijelog testa, nakon što smo uglavnom grafički provjeravali ovisnosti svih faktora o godinama, napravili smo linearni model s godinama kao regresorom i koeficijentom savjesnosti kao nezavnisnom varijablom i dobili dobar rezultat. Možemo zaključiti da su “stariji ljudi više savjesni”, što je donekle i logično jer čovjek odrastanjem posteje svjesniji svojih obveza i odgovorniji prema njima.